Lexicon Effects on Chinese Information Retrieval

نویسنده

  • Kui-Lam Kwok
چکیده

We investigate the effects of lexicon size and stopwords on Chinese information retrieval using our method of short-word segmentation based on simple language usage rules and statistics. These rules allow us to employ a small lexicon of only 2,175 entries and provide quite admirable retrieval results. It is noticed that accurate segmentation is not essential for good retrieval. Larger lexicons can lead to incremental improvements. The presence of stopwords do not contribute much noise to IR. Their removal risks elimination of crucial words in a query and adversely affect retrieval, especially when the queries are short. Short queries of a few words perform more than 10% worse than paragraph-size queries.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Construction of a Chinese-english Verb Lexicon for Embedded Machine Translation in Cross-language Information Retrieval

This paper addresses the problem of automatic acquisition of lexical knowledge for rapid construction of MT engines multilingual applications. We describe new techniques for large-scale construction of a Chinese-English verb lexicon and we evaluate the coverage and eeectiveness of the resulting lexicon for a structured MT approach that is embedded in a cross-language information retrieval syste...

متن کامل

Building a Chinese-English Mapping between Verb Concepts for Multilingual Applications

This paper addresses the problem of building conceptual resources for multilingual applications. We describe new techniques for large-scale construction of a Chinese-English lexicon for verbs, using thematic-role information to create links between Chinese and English conceptual information. We then present an approach to compensating for gaps in the existing resources. The resulting lexicon is...

متن کامل

Chinese-English Semantic Resource Construction

We describe an approach to large-scale construction of a semantic lexicon for Chinese verbs. We leverage off of three existing resources— a classification of English verbs called EVCA (English Verbs Classes and Alternations) (Levin, 1993), a Chinese conceptual database called HowNet (Zhendong, 1988c; Zhendong, 1988b; Zhendong, 1988a) (http://www.how-net.com), and a large machine-readable dictio...

متن کامل

Phrase Alignment Based on Combination of Multiple Strategies

Phrase translation pairs are very useful for bilingual lexicography, machine translation system, crosslingual information retrieval and many applications in natural language processing. There is phrase boundary information in parsing trees of sentences. Linguistics knowledge in translation lexicon and semantic lexicon, and statistics results from bilingual corpus can be used to align Chinese wo...

متن کامل

TREC-9 CLIR Experiments at MSRCN

In TREC-9, we participated in the English-Chinese Cross-Language Information Retrieval (CLIR) track. Our work involved two aspects: finding good methods for Chinese IR, and finding effective translation means between English and Chinese. On Chinese monolingual retrieval, we investigated the use of different entities as indexes, pseudorelevance feedback, and length normalization, and examined th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997